Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene

نویسنده

  • Tomaz Erjavec
چکیده

The paper describes a tool developed to process historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, morphosyntactic tags and lemmas. Such a tool is useful for developing historical corpora of highly-inflecting languages, enabling full text search in digital libraries of historical texts, for modernising such texts for today's readers and making it simpler to correct OCR transcriptions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Web Service Implementation of Linguistic Annotation for Slovene and English

This paper presents a web service for automatic linguistic annotation of Slovene and English texts. The texts are tokenised, morphosyntactically tagged and lemmatised by the ToTrTaLe annotation tool, while the web service for this annotation is made available in the Orange4WS and the ClowdFlows workflow construction environments. The workflows enable the users to apply the annotation tool as an...

متن کامل

NLP Web Services for Slovene and English: Morphosyntactic Tagging, Lemmatisation and Definition Extraction

This paper presents a web service for automatic linguistic annotation of Slovene and English texts. The web service enables text up-loading in a number of different input formats, and then converts, tokenises, tags and lemmatises the text, and returns the annotated text. The paper presents the ToTrTaLe annotation tool, and the implementation of the annotation workflow in two workflow constructi...

متن کامل

The position of Persian language and literature in Ottoman’s 19th century literature and historical developments

With the spread of western reforms in the 13th/9th century, Ottoman’s literature was reformed either. To reform Ottoman literature, they decided to transform the Ottoman language and literature relations with Persian language and literature. On one hand, they considered problems of Ottoman literature regarding Pindaric and its inefficiency for entering new areas such as novel, drama, and journa...

متن کامل

An Architecture for Editing Complex Digital Documents

In several on-going projects we were faced with the dilemma of how to reconcile our goal of delivering standardly encoded digital editions, yet have the actual editing and annotation performed by researchers and students who had no knowledge of XML and the Text Encoding Initiative Guidelines (TEI), and, for the most part, no great interest in learning them. The developed solution consists of al...

متن کامل

The JOS Linguistically Tagged Corpus of Slovene

The JOS language resources are meant to facilitate developments of HLT and corpus linguistics for the Slovene language and consist of the morphosyntactic specifications, defining the Slovene morphosyntactic features and tagset; two annotated corpora (jos100k and jos1M); and two web services (a concordancer and text annotation tool). The paper introduces these components, and concentrates on jos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011